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Abstract 

Many inference problems involving questions of optimality ask for the maximum 
or the minimum of a finite set of unknown quantities. This technical report derives 
the first two posterior moments of the maximum of two correlated Gaussian variables 
and the first two posterior moments of the two generating variables (corresponding to 
Gaussian approximations minimizing relative entropy). It is shown how this can be 
used to build a heuristic approximation to the maximum relationship over a finite set 
of Gaussian variables, allowing approximate inference by Expectation Propagation on 
such quantities. 

1 Introduction 

Many optimization problems involve inference on the maximum or minimum of a set of vari- 
ables. This very broad class includes shortest path problems Burton and Toint 1992 , Rein- 



forcement Learning Dearden et al. 1998 



Denzau and Bchrcns 



1984 



and scientific inference in Seismology |Neumann-| 
to name but a few. Often, there is a corresponding inverse 

where the optimal solution 



Ahuja and Orlin 2001 Heuberger 2004 



optimization problem 

is known with some uncertainty and the question is about the quantities generating this opti- 
mum. Most contemporary algorithms for this case aim to provide a point estimate (typically 
the least-squares solution), but have trouble offering an error estimate on this estimate as 
well. 

This work derives (Section |2| mean and variance of the posterior of the maximum of two cor- 
related Gaussian variables (for forward optimization problems), and the mean and variance 
on the posterior of the Gaussian variables generating the maximum (for inverse optimization 
problems) . These two moments correspond to the approximation within the exponential fam- 
ily of Gaussian distributions minimizing the Kullback-Leibler Divergence (relative entropy) 
to the true posterior. It will be shown how these results can be used to build a heuristic 
approximation to the max of a finite set of normal variables (Sec tion [3| . Together, this 
provides the necessary results for Expectation Propagation Minka[ |2001| on graphs involv- 
ing the "max" relationship. Because maximum and minimum obey the simple relationship 
max({ii}) = — min({— Xj}), this also allows inference on the minimum where necessary. 
Limitations of this approximation are examined in Section [4] 

The moments of the normalized likelihood function of the maximum of two normal variables 
have previously been derived by |Clark] [1961] . To my best knowledge, this is the first publi- 
cation deriving the full posterior, and the first to report the posterior for the inverse problem 



(see also Section 2.4.3) 



2 The Maximum of Two Gaussian Variables 
2.1 Notation 

We consider two normally distributed variables X\ and X2, forming the vector x. Let there 
be some prior (i.e. outside) information 3 g giving rise to the belief 



X. Jn 



1 



Dl ' ^ " ~ (1) 

over their values. Here we have defined a mean vector fi g = (p B i,p- B 2) T and a covariance 
matrix S g . The latter has the form 



°"ai P cr 0i cr g2 

pa fl iCr fl2 Cg2 



and thus 



^2(1 -P 2 ) 



"fl2 



-pcr fl lCT fl 2 



with the linear coefficient of correlation 



cxw(x\, X2) 

cr fll (7 B2 



(2) 



(3) 



(for notational convenience, the index g is dropped from p because there will be no chance for 
confusion). We further introduce the variable m which is defined through m = max(a;i, X2), 
and we assume that there is some outside prior information 3 m on the value of m as well: 



p(mpm) = A/"(m; fj, m , cr^) 
The inference problems to be solved are 

► The posterior over m given both 3 m and 3 S (jointly called 3 C ): 
p(m|3 m ) / p(x\m)p(x\3 B )dx 



(4) 



p(mp c ) = 



/ [p(m|3 m ) Jp(jc|m)p(a;p B ) da;] dm 



Z ^(mpn,) J p(x\m)p(x\3 B ) dx 



( 5 ) 

with the normalization constant Z = JJ p(x, m\3 c ) dx dm. This problem will be called 
the "forward" problem here. 

► The posterior over x given 3 C , 

p(x\3 g ) J p(m\x)p(m\3 m ) dm 



p{x\3 c ) = 



I [p(x\3 s ) J p(x\m)p(m\3 m ) 



dw -| dx = Z~ 1 p(x\3 g ) j p(m\x)p(m\3 m ) dm 

(6) 



This problem will be called the "inverse" problem. 
Throughout the derivations, the notation 



Af(x; u ) = 

4>{x) = 

<&(x) = 



V2ircr 2 



exp 



1 / x — jl 
2 



exp 



<j>(t) dt 



(7) 



1 + erf 



(x/2 



will be used to denote the general and standard normal probability density functions (PDF) 
and the standard normal cumulative distribution function (CDF). 

2.2 Some Integrals 

The derivations in this paper will repeatedly feature certain integrals. The first two incom- 
plete moments of the standard Gaussian are 



/' 

J — ( 

f 

J — cx 



t<t>{t) dt = -<f>(y) 



t 2 <j>(t)dt = <f>(y)-y(! > (y) 



(8) 



this is obvious directly from differentiation. A simple substitution gives 
rV 



I 

J — ( 

/ 

J — ( 



W(t;a,p 2 )dt - 



t 2 M(t;a,(3 2 )dt =(a 2 + /3 2 )$ 



y-a 

p 

y-a 



-{a + y)f3<j> 



y — a 

y-g 



(9) 
(10) 



m,3 m 



2 



Further, we will use the integrals 



x$ 



oo 
oo 



x — a 



M(x;a,p 2 )dx = $(z) 



Af(x; a, p 2 ) dx = a$(z) + 



ft 2 



b^i + p 2 /v- 



4{z) 



x*$ 



Af(x; a, p 2 ) dx = {a 2 + /3 2 )$(z) + 



2 a 



P 2 



by/1 + /P /b 2 b 2 + p 2 



(z) 



where 



by/l + /3 2 /b 2 



, , (11) 

A derivation of these results can, for example, be found in Rasmussen and Williams] |2006| 
section 3.9] 

2.3 Analytic Forms 
2.3.1 Forward Problem 

Neither of the posterior distributions are normal themselves. The forward posterior is 
p(m\3 c ) = Z~ l p(m\3 m ) / / p(x\m)p(x\3 a )dx 



Z 1 p(m\3 m ) 



S(xi — m)p(x\3 g ) dx2 + / S(x2 — m)p(x\3 g ) dx 



dxi 



/OO r-X\ 
5(xi—m) / p(x\3 g ) dx 2 dxi 
-oo J —oo 

v v ' 

v\ 

/OO PX2 
8(x 2 -m) / p(x\3 B ) dxi dx 2 
-oo J —oo 



. (12) 

For a motivation of the change in the integration ranges from the second to the third line in 
Equation (12 1, consider the sketch in Figure [I] Since the two summands are related to each 




Figure 1: Sketch of the integration range for v 2 . The open set (xi,x 2 ) £ ((— oo, oo), (x\, oo)) is 
identical to the open set (xi,x 2 ) € ((— oo, x 2 ), (— oo, oo)). 

other through the symmetry x\ «-> x 2 , consider only the first term, v\. To solve the integrals, 
note that the bi-variate Gaussian p(x\3 g ) can be re-written as 

p(xi,x 2 \3 g ) = p(x 1 \3 a )p(x 2 \xi,3 s ) 



1 



2na 2 sl 
1 



: exp 



1 / xi - Mgi 

2 <J gl 



^2^(1 p 2 ) 



exp 



1 



^ 2 g2 (l-p 2 ) 



x 2 - ( p g2 + p— (x 1 - ju fl i) 



(13) 



3 



So we can simplify v\ to 
V\ = p(m\3 m )jV(m; Mol, o^i) 

= p(m\3 m )J\f(m; Meii 
We introduce the substitution 



1 



2™ 2 2 (1 - p 2 ) 
1 

2™ 2 2 (1 - p 2 ) 



: exp 



2(1 - p 2 ) V ^ B 2 



^2 - M B 2 _ p m ~ Mfll 



°0l 



dx 2 



: exp 



1 /x 2 -/i B 2 -P^Ora-Msi)' 

2 I a fl2 (l-p2)i/2 



dx 2 



^2-p B 2-p^(m-p Bl ) d< 
i(x 2 ) = - ON , with Jacobian 



1 



^(l-p 2 ) 1 / 2 dx 2 a 2 (l-p 2 )V2 

Which allows us to solve the integral and find the posterior up to normalization 



(14) 
(15) 



p(m\3 c ) = Z 1 7V(^ m ;^ Bl ,cr 2 1 + CT 2 1 )7V(m;/i cl ,cr 2 1 )$ 



a s ia B2 (l-p 2 )i/2 



+ Z 1 M{[i m \[i ? , 2 ,a 1 m +cr 2 2 )7V(TO;^ c2 ,cr 2 2 )$ 



2 \* /" ( CT fl2 - pcr fl i)m - cr B 2/i B i + pcr B i/i B 2 



a B2 a Bl (l-p 2 )i/2 



Where we have used the abbreviations 



2 _. <jl*m 
cl — _2 , _ 2 



and /x c i 



^ + ^ U 2 

, CT B 1 CT m 



cl 



(16) 



(17) 



Pc,er 2 



for the mean and variance of the product of two Gaussian^ and analogously for /i c2 and 
er c2 . To find the normalization constant Z, we use the first identity in Equation (11 1 to get Z 



Z = JV (/i m ; /i B i, a\ + <7^)$(fci) + Af(fi m ; M B 2,^m + cr 2 2 )$(fc 2 ) 



(19) 



with 



ki,k 2 



k = (gfll - P°' B 2)Pcl - gBlMflg + Pg B 2P B l ^ , _ (CT B 2 - pC r B l)Mc2 - 0- B2 /i g i + pcr B i^ B2 

- - 1/2 



Ka^C 1 - P 2 ) + (agy - Pv S 2) 2<j2 i] 



Kl^ 2 (l - P 2 ) + (^ B 2 - P^l) 2 ^] 

(20) 



2.3.2 Inverse Problem 

The conditional probability of x on m is 

p(xi 7 x 2 \m) — 6(xi — x 2 )<5(xi — m) + 6(x 2 — Xi)S(x2 — m) 
where 0(y) is Heaviside's step function. Therefore, the conditional of x on 3 m is 

p(x 1} x 2 \3 m ) = I p(m\x 1 ,x 2 )p(m\3 m )dm 



(21) 



(22) 



= 6(xi - x 2 )Af(xi; Hm,(?m) + @(x 2 - xi)7V(x 2 ; p m , cr^) 



which is a proper (i.e. normalizable) distribution, but becomes normalizable after multipli- 
cation with the prior: 



p(x\3 c ) = Z 1 Q(xi - x 2 )A r (xi;/i ra , tr 2 )A/"(a;; /x g , S fl ) 



+ Z 1 6(x 2 - xi)jV(x 2 ;Mm, ^mV^C 35 ! A* B ' S s) 



(23) 



Figure [2] illustrates the shape of these functions by way of some concrete examples. 
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Figure 2: Illustrative plots for the analytical form of the forward and inverse posteriors. Left: 
Inference on m. Prior distribution and marginals on x^. Posteriors for five different values of p: 
-0.9 (most peaked), -0.5, 0.0 (thick line), 0.5 and 0.9 (broadest). As an experimental verification, 
a histogram of 20,000 samples from the posterior (generated by rejection sampling, with p = 0) is 
shown in blue. Right: Inference on the inverse problem: Prior with p g = (1,1) T , er g i = er g 2 = 1 
and p = —0.5. Data on m with p, m = 1, <r m = 1 gives the posterior in red. Note the bimodality 
arising in this particular case. 



I[m = max(xi, £2) 



cov(xi,x 2 ) = p 




Xi 




X2 



Figure 3: Factor graph representation of the functional relationships in the inference problems 



2.4 Moment Matching 

The analytical forms derived in the preceeding sections are clearly not members of the normal 
exponential family. If x has more than two elements, they also quickly take on complicated 
forms that are expensive to evaluate. If the application in question allows, it might thus be 
desirable to find Gaussian approximations to the posteriors. This section contains derivations 
for the first two moments of both posteriors. The Gaussian distributions q matching these 
moments minimize the Kullback-Leibler divergence -Pkl(p1 1<?) = ,f p{y) log(p(y)/g(j/)) dj/ to 
the correct posterior p within the Gaussian family [see, e.g. Bishop) 2006[ Section 10.7]). 



2.4.1 Forward Problem 



We will denote the mean and variance of the posterior of the max as /t* m (i2) and 0^ n ( 12 ) f° r 
reasons that will become clear in Section [3] The corresponding integrals to solve are 



(m) = /i m (i2) = / mp(m\3 c ) dm = Z 1 / m(vi + v 2 ) dm 
J —00 J 

/>oo 

(m) 2 = crL 12 - / m 2 p{mp c )dm 



(24) 



This is using the standard result that 
jV{x ] a 1 ,bl)N(x;a 2 ,b%) =N{a 1 ;a 2 ,b 2 1 +b%)Af 
which can be derived by completing the square, a simple proof that is omitted here 



1 



Oi Q2\ (}_,}_ 
' 1 b\ b\) \b\ + b\) '\b\' b\ 



(18) 



5 



Comparison with Equation (16 1 shows that these two integrals are solved by Equation (111. 

b 2 4>(k 2 )' 



The solutions are thus, after some algebra, 

h <f>(ki) 



pm(12) — w l 



Mel + CTcl 



w 2 



Pc2 + 0"c2 



°"m(12) = I [Mcl + 

+ W2 <^ [a4 + 0"c2] 



2Ai c io- c i— - k\a 2 a 
ai 



a 2 $(fc 2 ) 
foil 



(25) 



O 62 L 2 & 2 n 
^Mc2CT c 2 K 2 a c2 ^ 

a 2 a 2 



0(fc 2 ) 1 



2 

Mm(12) 



where 

Wi = Z~ 1 Af(fj, m ; Mfli,^ + ct 2 )$(/ci) 



Wi,ai,b z 



(26) 



W 2 = Z 1 7V(Ai m ;/i 8 2,^ + ^2)^(^2) 

fl l = kfll^C 1 - P 2 ) + ( CT fll - P cr fl2) 2 ^c 2 l] V2 «2 = [o-fllO-g 2 (l - P 2 ) + (°" 2 - PO-0l) 2 CT^ 2 ] V2 

(27) 

fol = CT c i(cr g i - pcr g2 ) 6 2 = (J c2 ((J g2 - p(? g l) (28) 

2.4.2 Inverse Problem 

The derivation for the inverse problem is just slightly more involved. We are interested in the 
moments of the marginals p(xi\3 c ) and p(x 2 \3 c ), and will denote these means and variances 
with /ii( m 2), cr i( m 2)j e t cetera. From Equation (23 1, we get 

/oo px-i 
xi / Af(xi; fi m ,a^)Af(x; (i g ,'E B )dx 2 dxi 
-00 J — 00 

-OO J — OO 



(29) 



The first integral is in fact identical to the first term of p, m (i 2 ). The second term, however, 
involves the first incomplete moment: 



/OO rX 2 
j x 1 Af(x 2 \ p mi a 2 m )N{x] /x fl ,S fl )dxida:2 
-OO J — OO 



U{x 2 \p m ,(r 2 n )U{x 2 ;p g2 ,o- 2 2 ) I xiM 



xi; n s i + p—(x 2 - Htf), o- i(l - p 2 ) 



dxi dx 2 
(30) 



The inner integral can be solved using the result given in Equation (10 1, leading to an 
expression solved by Equation (ITT]) . After a bit of algebra, we arrive at the final result 



Pl(m2) = Wi 



Mel + c c i 



6l 0(fcQ 

01 



W 2 



, °fll / s\ , ^ ^(^2) 

M B i + 9— Mc2 - Mfl2 + — ^tttt 



i 2 foil 0(fel) 



foi 



W 2 < cr 



01 



°0l cr fl2 

c 



+ (l-p 2 )+p 2 



(31) 



Ml + CTy/l 2 ) 1 ^ ^3( 1+CT 2 2/ ^2)3/2 



'02 



Mi 



where 



A = P<J 2 2 a gl (l - p^-J - a 2 sl a s2 (l - p 2 ) 



B = 2p 2 -^-o- 2 2 (p c2 - n a2 ) + p— I 2cr 2 2 ^ 8 i + h b2 



Bl 



'02 



CT B2 



^1^2(1 -p 2 ) 

o- g2 - p<J g i 



MfllCT 2 l(l-p 2 )- 



°02 



CT B 2 - PCT B 1 



C = P 2 ^^^ - /) + <7»(1 - p 2 ) ( 1 + ) gfl2 (Mc2^ + Z^) 



'02 



^02/ 0-02 - PC B i 



(32) 



G 



with 



and h = " B1 " BZ ^ r - ' (33) 

o- B 2-po- fl i cr g2 ~pa gl 

The corresponding result for the posterior marginal on x 2 can be derived trivially from these 
results by exchanging the indices 1 and 2. Note that, as mentioned above, the first terms 
of these mixtures are shared with the posterior for to. Intuitively, this can be interpreted 
as follows: For the posterior on to, the first term (vx) corresponds to the statement that 
"if x\ > x 2 (the probability of this is encoded by the cumulative density term in Equation 
(16l) "then m is distributed like xi' (represented by the product of the probability density 
functions in ( |16[ )). This part of the relationship features in the inverse problem as well: 
If x\ > x 2 , then x\ is distributed like to. The second term in the posterior marginal on 
X\ corresponds to the statement that "if X\ < x 2 , then x 2 is distributed like to and X\ is 
distributed such that its distribution fits with the updated marginal of x 2 given the correlation 
between X\ and x 2 and the prior marginal on x\. 



2.4.3 Related Work 



The moments of the likelihood of the max have been derived before by Clark 1961 . That 
is, for er m — > 00, the posterior p(m\3 c ) reported here simplifies to a result reported by Clark: 



Mm(12) 



i(12) 



a $(jfe) 



$(-k) 



P 2 + v B 2 



S(fc) [p 2 gl + a 2 gl ] 



2M B i tJ B i- 



(<7 8 l - /?(T B 2) _ fc ^ 2 ( CT fll ~ P cr B 2) 5 



'fll" 



(cr fl2 - pcr fll ) (ft(-fc) 
a $(-&) 

$(fc)j 



[M B 2+^ 2 2] 



Mm(12) 



2p s2 &g 2 



and 



(cr g2 - pcr 8 l) 



Mfll - M B 2 



' B 2- 



((T B 2 - /JCTgl) 2 



0(-fc) 

$(-fc) 



where a = + a£ 2 - 2pa gl a g2 

(34) 

As expected, the posterior of the inverse problem simply becomes equal to the prior in this 



case. From Equation (31 1 we find 



Pu m2 ) -> $(&)Mi + — <A(fc) + *(-^)Mi - vi— — W-fy 

= $(fe) M i + at * 1 ~ pa2 4>(k) + (1 - $(fc)) Ml - ——————— 0(A:) 



(35) 



= Mi 

and similarly for the variance. 

The max-factor is also part of the Infer.NET software package Minka and Winn 2008] (to 
my knowledge, the derivations for this code have not been published yet). However, their 
implementation can only handle two independent Gaussian inputs (Section [3] introduces the 
max over a finite set of correlated variables). So their implementation corresponds to the 
case of p = 0, which leads to the following simplifications, presented here for reference: 



ki = 



Pel - Pg 2 



f = Pal 



O-l = C B i(cr B i + 0"c2) 

B = -p g i<J 2 g i 
h = er fll 



1/2 



Jl (.. _2 



(36) 



C = cr gl (p c2 a gl + p g i<7 c2 ) (37) 

(38) 



Figure [4] shows some of these approximations. The parameter settings used in this figure 
represent a worst case (e.g., the posterior over x is rarely so strongly bimodal.) 



3 The Maximum of a Finite Set 



3.1 Analytic Form 



Extending the analysis of Section 2.3 we can write the posterior over the max m of a finite 
set {:Ej}i=i,...,jv of variables, distributed according to an TV-dimensional version of Equation 
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0.0 0.5 1.0 1.5 2.0 2.5 3.0 



Figure 4: Illustrative plots for the Gaussian approximations to the posteriors. Same beliefs in 3 C 
as in Figure [2] Left: For the sake of readability, only the cases p = —0.9 (broadest), p = and 
p = 0.9 are plotted here. In red, dashed lines the corresponding three Gaussian approximations. 
Note the varying quality of fit. Right: Gaussian approximation (with ^i^ n2 ) — 1-06 and 0w to2 s = 
0.94) indicated by shaded area. 

(ra) — ■ p(m\J m )=Af(m;iJ mi ,aD 



l[m = max(a;i, X2)] 



k = l,...,N 



p(x t \3 s ) =Af(m;n B i,al i ) 



Figure 5: Factor graph representation of the inference problem on a finite set. The dashed "plate" 
represents N copies of generating variable nodes. 



with a new normalization constant Z 
p(m\2 c ) = Z N p(m\3 m ) J p(x\m)p(xfi s )dx 

= Z N ]\[(m; p m ,a m ) 



at, as 



where xu = (x%, . . . , Xi-\, Xi+\, . . . , xn). The conditional mean is 
Section 2.3.2] 



Xi 

p{{xj}j^i\xi,3 B ) Y[dxjdx 

^(x\ l ;fJ' g \i(x l ),'E s \ i )dx v 
(39) 



^2 / 5{m- x l )p{x l \3 g ) / • 
^(/UmSMfli) °m + ^^(m; /i ci , ofj / • • • 

see e.g. Bishop 2006 

(M\i( a; s) ) = Msj + ^flj'i^nii — Mfli) = Mflj + Pij—^-(Xi — flyi 

with the linear coefficient of correlation pij — ~E B ij/(cr g i<T g j). The conditional covar 
matrix is the Schur complement of = in S fl : 

E g \i,fej = Sflfcj — E flfc jCT Bi 2 Sgjj (41) 

In principle, it would be possible to follow the path laid out in the previous sections to calcu- 
late the first two moments of this distribution. However, while the univariate Gaussian CDF 
(essentially an evaluation of the error function) has computational cost comparable to eval- 
uating an exponential function, computationally efficient ways of calculating a multivariate 



(40) 
■iancc 
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Gaussian CDF are not generally available. So this approximation would need to involve an 
undesirable numerical integration. 



3.2 A Heuristic Approximation 



Another, cheaper option is to use an iterative procedure initially proposed by Clark 1961 
The idea is to start out with the approximation for only two of the generating variables. 
W.l.o.g., let these be X\ and X2, resulting in m( 12 ) = max(xi, x 2 )- Next, estimate mn23) — 
max(x3, m/12)) an d so on up to N y For the intermediate maxima, the likelihoods 



presented in Equation (34 1 suffice, and the prior is included in the last step (using Equation 
(25 1) to gain an approximate posterior over the maximum of the whole set. Of course, this 
necessitates an analytic expression for the correlation coefficient Pi(i...i-i) between the i-th 
variable and the max over the preceding variables. This was derived by Clark. Adopted to 
the notation used here and made more explicit, his result is 

03(12) = CT(~ 12 ) (o-l/>31$(fc(12)) + 0"2P32$(-fyl2))) (42) 

where pij — Y>ij/<JiGj, the index q has been dropped for simplicity and fcm) = (Ml — 



M2)/ ' v 1 a \ + a 2 * s the simplified version of k\ arising from Equation (20 1 under a 



Using this, we can build a recursive algorithm to calculate Pi(\...j) with j < i as 

p, : , ■{"■"■* lij = 1 (43) 

( 3) i'i'l i...j))^Pij • *(* i ..../ \pi \ ... i : else 

this necessitates a list fc^-j = (fc(i 2 ), . . . fc(i...i-i)) which is available at the necessary point in 
time from the calculation of previous maxima over the preceding parts of the set. Note that 
calculating Pi{\...i-\) involves i — 1 recursive function calls, so building the full approximation 
over the max of N variables is of complexity 0(N 2 ), as might be expected (although there 
are only (TV — 1) uses of the results in Equation (p5|). If all correlation coefficients are the 
same, p^ = pVij, then the recursive evaluations can be re-used in consecutive evaluations 
and the complexity drops to O(N). 

3.2.1 Inverse Problem 

The same iterative scheme can be used to provide an approximation for the inverse problem's 
posterior. First, the list fe^-j is build as in the preceding section. Then, approximations to 
the posterior marginals are build iteratively, starting with g(a;jvp c ), ending with q(x2\3c) 
and q(x\\3 c ). At each intermediate step, we use the EP approximation Minka 2001] : To 



jet q(xi\3 c ), use <z(m(i...i) | J ro ) = g l (w(i...t)|3 c )/(j , (mM...i)|3 g ) as an approximation to the prior 
over the subset max, and q(mri ...j_i)|3 g ) as the approximation on the max over the subset 

Up to Xi-\. 



4 Discussion of the Approximation's Quality 

Figure [6] gives some intuition on the quality of the approximation. For the purpose of this 
comparison, uncorrelated Gaussians were used because this allows the analytic evaluation 
of the true posterior (the CDF factorises into individual one-dimensional CDFs). The fit is 
reasonably good if the beliefs over the xi are either very similar (Figure [6]top right), or if the 
beliefs are "separated" , in the sense that one of the Xj provides a dominant contribution to the 
overall mixture (top left). The fit becomes bad when the mixture has many modes (bottom 
left) or a strong asymmetry (bottom right). The corresponding worst case distributions 
shown here were generated by setting p S i = a + b~ l and a 2 ^ — b~ l (left, a = —1,6= 16) or 
Pgi = ci and crj^ = 1^ + 1 (right, c = — l,d = 16). More quantitatively, consider Equation 



(34 1 or Equation (16), the case of the max of only two Gaussians. The two cases of good fit 



described above correspond to 

1. one mixture component dominating the mixture 



A ,, = . _ "' ;J > (44) 
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Figure 6: Quality and failure modes of the EP approximation. Max of five uncorrelated Gaussians. 
Top row: examples of good fits. Left: well separated beliefs. Right: similar beliefs. Bottom row: 
worst case examples. Left: high certainty contributions within the center. Right: high uncertainty 
in one tail. In all plots, beliefs over the Xi as slim black lines. True posterior over m in thick red, 
approximation in thick dashed blue. For simplicity, p(m\3 m ) was set to an uninformative value. 
See text for details. 



The likelihood then has one clearly dominating Gaussian component and the fit is good. 
In this case, the inverse problem is also a good fit, as each of the generating variables 
xi,X2 has one dominating component in its posterior. 

2. the two mixture components being almost identical: 

Mfll ~ Mg2 and a gl » a g2 (45) 

the likelihood then consists of two roughly identical Gaussian components with roughly 
the same weights, and is therefore roughly Gaussian. However, the approximation is 
bad for the inverse problem here, as the true posterior marginals become bimodal (c.f. 
Figure [2j right). This effect is particularly pronounced if the mean of the prior and the 
likelihood differ significantly. 

These observations suggest a potential increase in the quality of the approximation to be 
gained from calculating all N(N — 1) weight-generators kij as defined in Equation (44) and 
iteratively choosing the pair ij with maximal kij. However, this re-ordering has to be updated 
after each incremental two-component max operation, involving a re-calculation of up to N 
correlation coefficients. It thus raises the complexity of calculating the approximation for the 
overall max from 0(N 2 ) to 0(N 3 ). Initial experiments suggest that the potential gain in fit 
is almost always negligible. 



5 Conclusion 

This technical report derived the first two moments of the posterior over the maximum of a 
pair of Gaussian variables, and over the posterior over the two generating variables. These 
moments can be used for approximate Inference on their own, or as part of a larger graphical 
model using Expectation Propagation. I have also shown how to extend the usefulness of these 
approximations to finite sets of Gaussian variables using a heuristic iterative approximation. 
The quality of the approximation depends on the location and precision of the belief over the 
generating variables relative to each other, but is always good enough to provide a meaningful 
point estimate and error measure. It is sufficiently robust to deal with inconsistent belief 
assignments and large numbers of generating variables (see Figure |7| . 
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Figure 7: Illustrative examples for the use of the approximation in EP message passing. p(xi\3 g ) in 
black dashed lines. p(m\3 m ) in red dotted. Marginals after EP message passing as corresponding 
solid lines. Top left: Max over 5 uncorrected variables. Only the two variables contributing 
significantly to the max change their beliefs. Top right: same as previous, but with pij = 0.9 for 
all ij. The change in belief over the dominating x,i now also effects the other beliefs, as expected. 
Bottom left: The approximation is well-behaved under inconsistent beliefs. p(m\3 m ) was set 
inconsistently low relative to the beliefs on the Xi (all = 0.2). Note that the belief over the 
largest Xi extends beyond the belief over m as a result of the moment-matching. Bottom right: 
The approximation is stable for large values of N. Maximum over 50 correlated normals, all 
were set to 0.5. 
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